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Abstract 

We present TT2NE, a new algorithm to predict RNA secondary structures with pseudoknots. 
The method is based on a classification of RNA structures according to their topological genus. 
TT2NE guarantees to find the minimum free energy structure irrespectively of pseudoknot topol- 
ogy. This unique proficiency is obtained at the expense of the maximum length of sequence 
that can be treated but comparison with state-of-the-art algorithms shows that TT2NE is a 
very powerful tool within its limits. Analysis of TT2NE's wrong predictions sheds light on the 
need to study how sterical constraints limit the range of pseudoknotted structures that can be 
formed from a given sequence. An implementation of TT2NE on a public server can be found at 
http:/ /ipht. cea.fr/rna/tt2ne.php 
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INTRODUCTION 



In the past twenty years, there has been a tremendous increase of interest of the biological 
community for RNA. This biopolymer, which was at first merely considered as a simple 
information carrier, was gradually proven to be a major actor in the biology of the cell pp. 
It was first discovered that some RNAs might have enzymatic activity (ribozymes) and as 
such would directly play a crucial role in the biochemical reactions taking place in the cell. 
More recently, it was also discovered that some RNAs, in particular micro-RNAs, have a 
post-transcriptional regulation role in the cell by controlling the level of translation of some 
messenger RNAs. Up to 30% of human genes might be regulated by such micro-RNAs. 
At present, it is also believed that a considerable amount of "junk" (non-coding) DNA is 
transcribed into some non-coding RNAs, the role of which is still unclear. 

Since the RNA functionality is mostly determined by its three-dimensional conformation, 
the accurate prediction of RNA folding from the base sequence is a central issue [2]. It is 
strongly believed that the biological activity of RNA (be it enzymatic or regulatory), is 
implemented through the binding of some unpaired bases of the RNA with their ligand. It 
is thus crucial to have a precise and reliable map of all the pairings taking place in RNA and 
to correctly identify loops. The complete list of all Watson-Crick and Wobble base pairs in 
RNA is called the secondary structure of RNA. 

Since the folding of even short RNA molecules takes too long to perform with all- 
atoms simulations including explicit solvent, the more modest goal of solely obtaining the 
most probable secondary structures based on experimentally derived base-pairing and base- 
stacking free energies has been pursued. It seems very plausible that (as in NMR pro- 
tein structure prediction) the secondary structure of RNAs is sufficiently constraining to 
entirely and unambiguously determine the 3-dimensional structure of the molecule. This 
3-dimensional structure of the RNA in turn controls the biochemistry of the molecule, by 
making certain regions of its surface accessible to the ligand molecule. 

In this paper, we will adhere to the notion that there is an effective free energy which gov- 
erns the formation of secondary structures, so that the optimal folding of an RNA sequence 
is found as the minimum free energy structure (MFE for short). The problem of finding the 
MFE structure given a certain sequence has been conceptually solved provided the MFE is 
planar, ie the MFE structure contains no pair (k,l) such that i < k < j < I. In that 
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case, polynomial algorithms which can treat long RNAs assuming a mostly linear free energy 
model have been found [3HS]. Otherwise, the MFE structure is said to contain pseudoknots 
and finding it has been shown to be an NP-complete problem with respect to the sequence 
length [6] . Even if pseudoknots represent a small part of known structures, they often have 
a functional role [3 [8] and the problem of their prediction must be addressed. 

Three main algorithmic strategies can be thought of to take into account the NP- 
completeness of pseudoknotted MFE prediction : 1) empirical search of the MFE using 
heuristic methods, 2) efficient exact calculations on a restricted class of pseudoknots and 
3) exact calculations, using various tricks to allow for the treatment of as long as possible 
sequences. 

Here we present TT2NE, an algorithm that falls into the latter category. TT2NE relies 
on the "maximum weighted independent set" (WIS) formalism. In this formalism, an RNA 
structure is viewed as an aggregate of stem-like structures (helices or helices comprised of 
bulges of size 1 or internal loops of size lxl). These stem-like structures can be viewed as 
points in the space of all helical fragments available from a given sequence and we will refer 
to them as "helipoints" . Please note that our notion of helipoints is in fact not trivial and 
differs from what is done in algorithms based on the WIS formalism, where they generally 
reduce to maximum helices (see the explanation in material and methods). Given a certain 
sequence, the set of all possible helipoints is computed and a weighted graph is built in the 
following way: 

• the vertices of the graph are the helipoints, with a weight given by the opposite of 
their free energy of formation, 

• two vertices are connected by an arch if and only if the corresponding helipoints are 
not compatible in the same secondary structure. 

Indeed, two helipoints may be mutually exclusive in a graph: this is for example the case 
if they share at least one base (since triplexes are forbidden). Finding the MFE structure 
thus amounts to finding the maximum weighted independent set of the graph, i.e. the set 
of pairwise compatible helipoints such that the overall free energy is minimum. 

Given a certain sequence x, let's note N x the number of available helipoints and Q x the 
associated graph. The base routine of TT2NE is a simple exhaustive depth exploration of all 
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Initialization of global variables 

Current structure S c — 

Current free energy AF C — 

Current minimum free energy structure S cm = 

Current minimum free energy AF cm = 

Procedure TT2NE 

for i = l,N x 

Recursive_exploration(i) 
end for 

Procedure Recursive_exploration(i) 

(0) if (AF C + AF min (i) > AF cm ) exit 

(1) Test of compatibility between S c and hi 

if (hi conflicts with S c ) exit 

(2) Addition of hi to S c and update of AF C 

S C = S C U hi AF C = AF(S C ) 

(2b) if (genus(S c ) > g max ) go to step(5) 

(3) Is the current structure the best one found so far ? 

(4) Recursive expansion of S c with less stable helipoints 

for j = i + 1,N X 

Recursive_exploration(j) 
end for 

(5) Backtrack 

S C = S C - hi AF C = AF(S C ) 



FIG. 1: Pseudocode of TT2NE. The base routine is written in black and performs an 
exhaustive enumeration of all independent sets of Q x . In the end, the MFE structure can be 
read in the global variable AF cm . The two red lines are improvements discussed in the text. 

independent sets of Q x using a backtracking procedure, where vertices are added to the cur- 
rent structure in the increasing order of their free energy, that is decreasing order of weight 
(see black pseudocode in Fig. [T]). There is in particular no restriction on the pseudoknots 
topologies that TT2NE can generate. However, this strategy is very inefficient. In this arti- 
cle we propose two ideas to improve it. First, we use a new treatment of pseudoknots that 
restrain TT2NE's search to a much smaller and relevant subspace of independent sets. Sec- 
ond, we take advantage of a peculiar energy model to enforce a branch-and-bound procedure 
that speeds up the search of the MFE without loss of exactness. A server implementation 
of TT2NE can be found at |http: / /ipht.cea.fr/rna/tt2ne.php 
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A new treatment of pseudoknots 

In a previous series of studies [21 EI] , we have proposed a classification of pseudoknots 
according to their topological genus. The genus is an integer number that captures the 
complexity of a pseudoknot and we have shown that naturally occurring pseudoknots have 
a much lower genus than expected in randomly paired polymers [10J. In particular, we have 
shown that for sequences of sizes up to 500 bases, the genus does not exceed 2. For sizes 
around 1500 bases, the genus ranges between 2 and 6. Finally, for the largest RNAs (around 
3000 bases) the genus may reach 17. 

We use this fact to guide TT2NE's search of relevant pseudoknots in two ways. First, a 
penalty for pseudoknot formation depending on their genus is introduced in the free energy 
model. Although more sophisticated forms could be imagined, for now we chose a simple 
linear form. A pseudoknot of genus g is assigned a penalty +jig where we set fi to +1.5 
kcal/mol. This value of /i was obtained by optimizing the number of correctly predicted 
structures by our algorithm. Second, an upper limit g max is introduced. This limit, tunable 
by the user, has a critical importance as it defines the space of pseudoknots where TT2NE 
will restrain its search. The size of this space grows exponentially with g m ax, so this number 
has a great impact on the computational time required by TT2NE. Based on the relation 
of RNA size to genus mentioned above, we may safely fix a maximum genus of 3 for RNA 
sizes smaller than 250, typically the maximal size we can treat with our present algorithm 
due to computational time constraints. 

We have shown that the most standards pseudoknots, i.e the H-pseudoknot and the 
kissing-hairpin, have both genus 1. It implies that if one is interested in short chains which 
carry these kind of pseudoknots, setting g max to 1 is sufficient and would save a lot of 
computational time. Setting g max to a large value would leave the problem as open as 
possible, but again, a wise tuning of this parameter proves a relevant and efficient way to 
locate the MFE in a fast way. 

A branch-and-bound procedure 

The base routine of TT2NE can be improved using a branch-and-bound procedure. The 
idea is to speed up the search of the MFE of Q x by computing first the MFE of some relevant 
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subgraphs. The crux of such a branch- and-bound procedure is to be able to relate those 
partial solutions to the general problem and this can be done in TT2NE by taking advantage 
of a peculiar energy model. 

Energy model 

Vertices are sorted in increasing order of free energy, ie the vertex 1 represents the most 
favorable helipoint. We note AFi the free energy of the i th vertex. Then in TT2NE the free 
energy of a structure S made of helipoints {/ij}jgn(S) is computed with the following model 
Mi : 

AF M ^(S) = AF * + »rnn m (S) + fig(S) (1) 
ien(s) 

where n m (S) is the number of multibranch loops of S and u m is the corresponding penalty 
of formation. Note that in this model there is no term for large internal loops or bulges. We 
also introduce the simple model Mq where the free energy of S is just the sum of the free 
energies of the helipoints it is made of : 

AF M °(S) = J2 AF * ( 2 ) 

ien(s) 

Property 

Let AF min (i) be the MFE of structures comprised of helipoints with indices larger than 
i, according to the energy model M . AF min (i) would simply be the output of TT2NE 
when used on the restriction of Q x to its N x — i last vertices with model M . Let S° be a 
structure made of n helipoints and i n the index of its least stable helipoint. Let's note S/k the 
restriction of a structure S to its k most stable helipoints. Then it can be straightforwardly 
shown that the following property holds : 

VS, S /n = S ^ AF Mj (S) > AF Mj (S°) + AF min (i n + 1) for j = or 1 (3) 

The practical meaning of this relation is : there is a lower limit to the free energy of all 
structures that can be derived from S* by adding any combination of helipoints of indices 
more than i n . Consequently, if this lower limit is found to be larger than the current MFE 
that TT2NE has found so far, TT2NE can safely ignore all these structures : the global 
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MFE cannot be found in this ensemble. This property thus allows to further restrain the 
size of the search space for the MFE. 

Those two improvements can be incorporated in TT2NE as can be seen in red in Fig. [TJ 



MATERIALS AND METHODS 

Efficient calculation of the genus 

TT2NE requires to be able to efficiently update the genus of a structure upon addition 
or removal of a helipoint. In order to do so, we use a technique which was introduced by 
t'Hooft [llj. A structure of RNA is represented as a diagram whose arches are double lines 
that connect paired bases, such as represented in Fig.m[2]. 




(a) P = 5,L = 5 -> g = (b) P = 5, L = 3 -> g = 1 

FIG. 2: Examples of how to calculate the genus with a double-line diagram representation. 



In this process, loops are created within those diagrams and it can be shown that the 
genus of the corresponding structures can simply be calculated with : 

9=— (4) 

where P is the number of pairs and L the number of loops. Upon addition of a new pair to 
a structure, the genus variation Ag is given by 

A<? = — (5) 



We found a property that allows to calculate the term AL in an efficient way. Upon addition 
of a pair (i,j) to a certain diagram, 

{1 if i and j belongs to the same loop 
(6) 
— 1 otherwise 

Therefore, Ag can be straightforwardly calculated by checking whether the newly paired 
bases belong to the same loop and this operation can be efficiently performed in a time 
linear in the number of pairs of the diagram. The case of the removal of a pair is symmetric. 



Generation of the initial graph 

A helipoint is an ensemble of helices that share the same extremal pairs. Given two 
extremal pairs and (k, I), the set uf^ of all helices that end with these two pairs can be 
generated and their individual energies calculated according to a given energy model. The 
free energy F]£ of the helipoint is then computed as 

exp (-PF%) = ex P ('PEW) with = (ksT)- 1 (7) 

Helipoints are stem-like structural building blocks which account for all possible internal 
pairing possibilities that occur between their extremal pairs. The importance of this 
notion is well captured by considering for example such a sequence : GGGAGGG [...] 
CCCUUCCC. As one can see, a helix containing a "bulged" uracil can be formed from 
this sequence, but there are two ways to choose the "bulged" uracil. In order to describe 
this fact appropriately in statistical mechanics, it is important nor to neglect any of 
these possibilities neither to consider them as distinct competitors. Rather, the notion of 
helipoint implies that both possibilities would stabilize the pairing of these regions of the 
sequence. In this example, the calculation of the free energy according to equation [7] would 
indeed introduce an entropic bonus of — fc^Tln2 that accounts for this variability. 



The computation of helipoints free energies requires the setting of some values for the 
basic structural elements of RNA folds : stacking, terminal mismatches, helix formation 
penalty, bulges and internal loops. The three first families of terms have been taken from 
[12]. We computed the free energy of the bulges of size one as the energy of the stack of pairs 
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closing this bulge plus 3.8 kcal/mol. The energy of a helix comprising a 1 x 1 internal loop 
is computed as the sum of the free energies of the two helices delimited by this internal loop 
minus 3.85 kcal/mol. Larger internal loops and bulges of size more than one were not taken 
into account. In particular, helipoints do not include such kind of motifs. The multibranch 
loop formation penalty was not used (ie set to 0) in the work presented here, even though 
TT2NE could handle it. All helipoints of favorable (ie negative) free energies were kept 
to build the graph. Note that in most other algorithms based on the WIS formalism, only 
maximal favorable helices are kept (i.e. helices such that the outer nearest neighbors of their 
extremal pairs cannot pair). Our choice not to restrict our algorithm to maximal helipoints 
makes the problem harder since it makes the graph wider, but the reason will be explained 
in the discussion part below. 

Two helipoints were considered incompatible (i.e. they are connected in the graph) if : 

• they overlap 

• their concatenation generates an existing helipoint. 

• their concatenation produces a sterically impossible structure. 

This last requirement anticipates on a point that will be explained in the "discussion" 
section. 

Branch- and-bound procedure 

The equation [3] requires a prior computation of the terms AF min (i), that is the MFE 
of Q x restrained to helipoints of index larger than i. Those quantities are obtained 
by running TT2NE on those subgraphs. However, calculating those terms for all i is 
useless since the only needed quantity is AF min (l). Rather, one must choose a certain 
level up to which these terms should be calculated, in order to get a good balance 
between the time spent in doing so and the time saved later in the search of the MFE. 
In the work presented here, we generally computed the quantities AF m i n (i) for the 
350 least stable helipoints. 
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Suboptimal structures 

The algorithm presented here only outputs the MFE. It is very easy to adapt it to 
instead output a certain number of suboptimal structures specified by the user if 
needed. 

Heuristic 

For longer sequences, a heuristic can be used : the above techniques are first applied 
to the restriction of the graph to its most stable helipoints and the best structures 
output are then saturated with the remaining helipoints. This heuristic is identical to 
the initial problem with = N x and becomes more and more imprecise as Nh/N x — y 
0. 

Detailed results 

We compared TT2NE with McQfold p2], HotKnots [14J and Mfold [15] on a set of 
35 sequences which is quite similar to the set used in the original HotKnots paper. 
We did not compare it with the Pknots algorithm of Rivas and Eddy [IS] as its 
computation time is very long (it scales like the 6th power of the length of the 
sequence). Sequences were mostly retrieved from the Pseudobase [17] and are named 
after their Pseudobase entry with the exception of the sequence "lu8d" which is 
named after its PDB entry. For each sequence, sensitivity and positive predicted value 
(PPV) have been measured. The sensitivity is defined as the fraction of correctly 
predicted pairs of the native structure. The PPV is defined as the fraction of correctly 
predicted pairs of the predicted structure. Both are indicated in % in the following 
array (see Table below). Stars are pointing to sequences where the correct structure is 
actually the second best prediction. For each sequence, the best sensitivity predicted 
is emphasized in boldface. In all those tests, TT2NE's parameter g max was set to 3. 
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SPQUPI1CP 


lpnffth 

IvllU Ull 


gpnus 


Mfold 


HotKnots 


McOfold 


TT2NE 


epnus TT2NE 


lu8d 


68 


1 


69 - 100 


69 - 100 


69 - 100 


88 - 100 


1 


AMV3 


113 


1 


84 - 86 


84 - 86 


76 - 81 


87 - 85 


1 


BBMV 


116 


1 


- 


81 - 81 


86 - 82 


86 - 84 


1 


Bp_PK2 


91 


1 


81 - 96 


81 - 96 


87 - 87 


100 - 100 

-L KJ KJ -L \J \J 


1 


BVDV 


74 


1 


52 - 65 


52 - 61 


76 - 82 


96 - 96 


1 


BWYV 


51 


1 


55 - 55 


100 - 69 


55* - 55 


100 - 100 

_I_ W -L \J \J 


1 


Bt-PrP 


45 


1 


41 - 33 


41 - 38 


50 - 40 


50 - 35 


1 


CcTMV 


73 


3 


23 - 27 


23 - 27 


57 - 93 


42 - 52 


o 


CGMMV 

\_> \J 1V11V1 V 


85 


3 


58 - 69 


67 - 87 


38 - 48 


58 - 72 


o 


CoxB3 


73 


1 


68 - 89 


68 - 89 


92 - 100 


92 - 100 


1 


Ec_alpha 


108 


1 


45 - 29 


45 - 29 


50 - 37 


79 - 61 


1 


Ec_PKl 


31 


1 


0-0 


100 - 90 


100 - 90 


100 - 90 


1 


EC_PK4 


52 


1 


- 


68 - 100 


52 - 71 


100 - 100 

_I_ -L \J KJ 


1 


Ec-Rprcil 


72 


1 


68 - 90 


20 - 26 


51 - 71 


58 - 60 


1 


Ec_S15 


67 


1 


58 - 62 


100 - 73 


58* - 62 


100 - 73 


1 


GLRaV-3 


75 


1 


65 - 59 


65 - 59 


100 - 76 


100 - 76 


1 


HAV 


55 


1 


58 - 83 


58 - 83 


58 - 83 


58* - 83 


o 


HCV_229E 


74 


1 


79 - 100 


79 - 100 


100 - 100 

_I_ -L \J \J 


100 - 100 

-J- V/ V/ -L V 7 V 7 


1 


HDV 


87 


2 


65 - 70 


41* - 44 


75 - 75 


93 - 84 


2 


HDV_anti 


91 


2 


16 - 14 


16* - 14 


100 - 80 


72 - 58 


2 


Hs_PrP 


45 


1 


- 


0- 


54 - 42 


0-0 





IBV 


56 


1 


55 - 66 


100 - 100 


94 - 100 


94 - 100 


1 


Lp_PKl 


31 


1 


50 - 100 


50* - 100 


50 - 100 


50* - 100 
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SPQUP11CP 


lenffth 


genus 


Mfold 


Hot Knots 


McOfold 


TT2NE 


eenus TT2NE 


Mengo-PKC 


26 


1 


37 - 60 


- 


37 - 60 


100 - 100 


1 


minimallBV 


45 


1 


64 - 91 


100 - 94 


100 - 94 


100 - 94 


1 


MMTV 

1V11V1 J_ V 


34 


1 


- 


100 - 91 


100 - 91 


100 - 91 


1 


pKA-A 


36 


1 


50 - 66 


100 - 92 


100 - 92 


100 - 92 


1 


RSV 


128 


1 


74 - 76 


97 - 82 


100 - 95 


94 - 88 


1 


satRPV 


73 


1 


59 - 68 


59 - 68 


81 - 81 


81 - 81 


1 


SRV-1 


38 


1 


0-0 


100 - 100 


100 - 100 

_I_ W W -L \J \J 


100 - 100 


1 


T2 eene32 


33 


1 


58 - 70 


100 - 100 

_I_ -L KJ KJ 


100 - 100 


100 - 100 


1 


T4 eene32 


28 


1 


63 - 87 


63* - 87 


63 - 100 


100 - 100 

J- yj yj j. \j \j 


1 


TMV 


74 


3 


52 - 65 


52 - 61 


52 - 65 


48 - 54 


1 


Tt-LSU 


65 


1 


60 - 75 


95 - 100 


60- 100 


95 - 100 


1 


TYMV 


74 


1 


72 - 78 


70 - 73 


72 - 78 


72 - 69 


1 


average 






54 - 59 


65 - 70 


75.5 - 80 


82 - 81 





On the average, TT2NE achieves better performances on this set of test sequences. 
Comparison with HotKnots shows that these improvements originate from the 
different treatment of pseudoknots, as HotKnots and TT2NE otherwise use essentially 
the same energy model. 

Comments and discussion 

Despite the fact that TT2NE can find any type of topology and guarantees to output 
the MFE, it does not provide a 100% success. Why is that so? We have investigated 
the errors generated by TT2NE and we see two main causes: the first relates to 
the limit of the energy model used and the second is more specific to the nature of 
pseudoknots. 
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Limits due to the free energy model 

The Turner free energy model has been shown to be partly unable to explain planar 
secondary structures [T5]. TT2NE uses only a subset of this model : thus, there are 
errors coming from the part of this model we use, and others coming from the part we 
do not use. 

An example of the first case is provided by the sequence satRPV : the native 

secondary structure is almost correctly predicted, but an error is made because the 
2 CAGA 

helix is considered more thermodynamically favorable than the native 

GUCU 19 

1 AC AG 

one 

CUGU 16 

An example of the latter case can be seen with Ec-Rpml. There, the native structure 
contains a helipoint containing a 2 x 1 internal loop. The thermodynamics properties of 
2x1 internal loops are not properly taken into account in TT2NE. As a consequence, 
the energy of formation of that helipoint is not found to be negative and therefore it 
is not recognized as a relevant helipoint to store into the initial graph. In other words, 
this helipoint is not favorable and is thus not kept in the construction of the graph. 
This problem could be solved by allowing for the inclusion of 2 x 1 internal loops but 
this would dramatically increase the number of possible helipoints and the running 
time of TT2NE would grow exponentially. 

Limits due to the absence of steric constraints 

We also realized that predicting a pseudoknot is not only a question of free energy 
minimization : steric constraints also matter and some predicted sets of helipoints 
must sometimes be rejected because they do not correspond to any feasible geometry 
in 3D space. For example, we display in Fig. [3] a feature observed in the best sec- 
ondary structure predicted for the sequence Ec_alpha (using a standard diagrammatic 
representation) : 
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CCUGAAAACGGGCUUUUCAGC ' ' UGGCCCGUA 

FIG. 3: Example of a sterically impossible H-pseudoknot 

This pseudoknot is made of two helices respectively drawn in blue and black. Let's 
focus on the seven bases of the 5' strand of the black helix (ACGGGCU). The geometry 
of the nucleotides implies that the pairings organize according to the canonical A- 
helix shape. However, those seven bases also connect the two ends of the blue helix 
: they should therefore make up a hairpin loop. It is clear that these two kinds of 
geometry are mutually exclusive. This diagram therefore cannot match a real RNA 
structure and must be rejected. To create a sterically allowable pseudoknot between 
those regions, one or both helices should be shortened. We thus think that a perfect 
pseudoknot prediction algorithm should be able to include non-maximal helices. This 
necessity is also very well illustrated by the example of the mouse mammary tumor 
virus pseudoknot whose 3D structure has been resolved (PDB entry : lrnk) [19] . This 
pseudoknot is an H-pseudoknot and one of its helix is non-maximal. By looking at the 
sequence, one could think that one additional Wobble-pair could form but from looking 
at the 3D structure, it is clear that due to the peculiar geometry of this pseudoknot, 
the bases of the putative pair are in fact too far from each other to be able to pair. All 
algorithms tested on that sequence wrongly predict this additional pair (sensitivity of 
1 but PPV of 0.91). We thus have chosen by design to include all possible favorable 
non-maximal helipoints in the initial graph that TT2NE generates, even though it 
makes calculations longer. 

In fact, it is worth noticing that whenever a pseudoknot is predicted by TT2NE, its 
PPV is almost always smaller (or equal) than the sensitivity. This means that the 
predicted structures are somewhat overloaded with spurious pairings. We examined 
TT2NE's predicted MFE and we are convinced that most of the time, the helipoints 
predicted in excess cannot exist due to steric considerations. This point therefore 
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raises an important difference in the evaluation of algorithms for the prediction of 
secondary structures with and without pseudoknots, such as Mfold. For the latter, if 
some modifications entails an overall improvement of the sensitivity and the PPV of the 
predicted MFE, then we can conclude that the predictive power of such an algorithm 
has been improved. By contrast, with pseudoknot prediction algorithms, such an 
improvement can be misleading. In fact, the real output to be taken into account is 
not the MFE but the first sterically possible structure. Even if the predicted MFE has 
good sensitivity and PPV, it may happen that the best sterically possible structure is 
in fact completely different and has a bad score. We therefore think that the problem 
of the determination of sterically impossible structures is essential. As long as we do 
not know how to detect impossible structures in a fast and efficient way, pseudoknot 
prediction algorithms may output lots of wrong structures and the evaluation of such 
algorithms with standard statistical estimates such as sensitivity and PPV of the MFE 
is quite meaningless. 

The question thus remains : how to deal with steric constraints ? To our knowledge 
this is an open question. No clear criteria is known to decide whether a proposed 
pseudoknot is possible or not. For simple H-pseudoknots, where only two helipoints 
are involved, it is an easy task : during the generation of the initial graph, it is sufficient 
to declare two helipoints incompatible if they form a sterically impossible pseudoknot. 
In this version of TT2NE, we have used a simple test depicted in Fig. |4j However, 
this test is not foolproof as TT2NE still wrongly predicts the Wobble pair in the case 
discussed above. 

In this work, despite the lack of an adequate treatment of steric constraints, for every 
studied sequence, we have kept the full initial set of stable helipoints to check how 
it impacts on the complexity of the free energy minimization. We also reckoned that 
TT2NE cannot be used for sequences larger than 250 bases on a single standard 
processor unit, because the large number of helipoints makes the calculations too 
long. TT2NE must thus be seen as a tool for pseudoknot prediction, which somehow 
proves that penalizing pseudoknots according to their genus is a relevant and useful 
concept. As TT2NE builds RNA folds gradually by adding helipoints, as soon as a 
steric constraint verification algorithm will be available, it will be possible to have an 
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h + h> si 
h + h> s2 
min(sl, s2) < 11 (*) 




FIG. 4: Naive stericity tests used in this work for H pseudoknots. The constraint (*) is 

used to prevent the formation of real knots. 

ongoing procedure that will detect sterically impossible structures and will stop that 
branch of the search tree. This procedure will allow to greatly improve the output 
of TT2NE, as well as an important speeding up of the algorithm, since lots of paths 
will no longer be explored. We insist again on the need to tackle the problem of steric 
constraints as a necessary condition to substantially improve the field of pseudoknot 
prediction. 

The authors wish to thank A. Capdepon for setting up the TT2NE server at 
http:/ /ipht.cea.fr/rna/tt2ne.php[ 
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